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ABSTRACT 

In bacteria, small regulatory non-coding RNAs 
(sRNAs) are the most abundant class of post- 
transcriptional regulators. They are involved in 
diverse processes including quorum sensing, stress 
response, virulence and carbon metabolism. Recent 
developments in high-throughput techniques, such as 
genomic tiling arrays and RNA-Seq, have allowed ef- 
ficient detection and characterization of bacterial 
sRNAs. However, a comprehensive repository to 
host sRNAs and their annotations is not available. 
Existing databases suffer from a limited number of 
bacterial species or sRNAs included. In addition, 
these databases do not have tools to integrate or 
analyse high-throughput sequencing data. Here, we 
have developed BSRD (http://kwanlab.bio.cuhk.edu. 
hk/BSRD), a comprehensive bacterial sRNAs 
database, as a repository for published bacterial 
sRNA sequences with annotations and expression 
profiles. BSRD contains over nine times more experi- 
mentally validated sRNAs than any other available 
databases. BSRD also provides combinatorial regula- 
tory networks of transcription factors and sRNAs with 
their common targets. We have built and implemented 
in BSRD a novel RNA-Seq analysis platform, 
sRNADeep, to characterize sRNAs in large-scale tran- 
scriptome sequencing projects. We will update BSRD 
regularly. 

INTRODUCTION 

Small regulatory RNAs (sRNAs) in bacteria are a class 
of non-coding RNA genes. They are usually 50-500 bp 
long and encoded in an estimated amount of ~200-300 
copies in a typical bacterial genome (1). sRNAs can 
be categorized as ra-encoded antisense sRNAs, which 
are completely complementary to their targets, and 



trans-encoded antisense sRNAs, which are only partially 
complementary to their targets, with binding facilitated by 
the RNA-binding protein Hfq (2). sRNAs are important 
post-transcriptional regulators, they can either inhibit 
translation of mRNAs by degrading the mRNAs or 
masking the ribosome binding sites, or activate the trans- 
lation by opening the ribosome binding sites or increasing 
mRNA stability (3). They are involved in many crucial 
cellular processes, including bionlm formation (4) and 
quorum sensing (5). sRNAs may directly or indirectly 
regulate most bacterial genes (6). 

Since the first discovery of the chromosome-encoded 
sRNA regulator, MicF, in Escherichia coli (7), detection 
of sRNAs has been hampered by traditional genetic 
screening methods because of their relatively small size, 
non-protein-coding nature and locating in intergenic 
regions (8). As a result, only a limited amount of 
sRNAs has been identified. Recent advance in computa- 
tional methods and high-throughput techniques such as 
genomic tiling microarrays and deep sequencing has dis- 
covered many sRNAs and provided invaluable insights 
into the detection and characterization of bacterial 
sRNAs. 

Although sRNAs play crucial regulatory roles and their 
discovery has been greatly facilitated in recent years, the 
quantity and quality of currently available sRNA databases 
are far from desirable. Databases such as RegulonDB (9) 
and Ecocyc (10) include only sRNAs in the E. coli K12 
strain. Rfam (11), which is a database of structural RNA 
families, contains only a few bacterial sRNA families. 
sRNAMap (12) contains only data from gram-negative 
strains and lacks current updates. Other databases also 
focused on information on sRNA targets, however, 
whereas sRNATarbase (13) contains exclusively experimen- 
tally validated sRNA targets; RNApredator (14) and 
sRNATarget (15) are solely based on computational pre- 
dictions. Furthermore, annotations of sRNAs in most of 
these databases are not up-to-date, and these databases can 
neither integrate nor analyse next-generation sequencing 
data. Therefore, a repository that collects sequences of all 
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published sRNAs and with information such as their an- 
notations and expression profiles is needed. 

Here, we present BSRD, a comprehensive bacterial 
sRNAs database, to serve as a repository for bacterial 
sRNA sequences and their annotations and expression 
profiles. In addition to sRNAs annotated in the public 
databases, we also include in BSRD manually curated 
bacterial sRNAs and their annotations from the literature. 
Besides identification of sRNAs, BSRD also provides ex- 
tensive information on functional characterizations and 
expression profiles of sRNAs. Furthermore, we have de- 
veloped and implemented in BSRD a new RNA-Seq 
analysis platform, sRNADeep, for characterization of 
sRNAs from high-throughput deep sequencing data. 

DATA COLLECTION AND CURATION 

Acquisition of sRNA sequences 

BSRD contains three kinds of sRNA sequences grouped 
according to their discovery methods: (i) by experimental 
validation, (ii) by sequence and structural conservation, 
and (hi) by RNA-Seq or tiling microarray experiments. 
We have obtained 79 and 87 experimentally validated 
sRNAs from RegulonDB and sRNAMap, respectively. 
By literature mining, we first obtained a bacterial strain 
list from the NCBI taxonomy and searched the NCBI 
PubMed database using the keyword sRNA and strain 
name' with the PubCrawler program (16). sRNA informa- 
tion was then extracted manually from all of the resulting 
445 relevant articles. These include sRNA name or alias, 
species, physical position, strand, identification method, 
growth phase, Gene Expression Omnibus accession, 
target genes and regulation effect, and regulators. 
sRNAs will be regarded as experimentally validated if 
they are identified by either Northern blot or reverse tran- 
scription polymerase chain reaction. Finally, 964 experi- 
mentally validated sRNAs were retrieved and added to 
BSRD. 

A total of 6266 sRNA homologs were collected from 
the Rfam database. In addition, 2334 bacteria genes 
annotated as 'ncRNA' or 'antisense RNA' in the NCBI 
Gene database were also collected. An additional 310 
sRNAs from sRNAMap were also retrieved. As a result, 
a total of 8248 non-redundant sRNA homologs were 
added to BSRD. We have also obtained and added to 
BSRD 507 candidate sRNAs identified from high- 
throughput sequencing datasets. These sRNAs display 
either differential expression in various conditions or a 
high expression in a single condition. However, as the 
current computational prediction method for novel 
sRNAs is of low precision (6-12%) and sensitivity 
(20-49%) (17), datasets solely predicted in silico were 
not included in BSRD. In addition, 20 115 bacteria regu- 
latory elements were also integrated. 

For sRNAs found in multiple resources, exact duplicate 
hits were merged, but we kept others, which need to be 
further verified by rapid amplification of cDNA ends 
(RACE) or other experimental techniques. In BSRD, a 
new sRNA nomenclature system modified from Chen's 
system (18) is used: a sRNA is indicated by an initial V, 



which stands for small RNA, followed by a three-letter 
genome ID used in the KEGG database, and a number 
that indicates its genomic location. We also add an ending 
number that indicates the number of sRNAs identified in 
this location. 

Functional annotations of identified sRNAs 

In BSRD, each sRNA entry contains seven sections of 
descriptions: Basic Info, UCSC Browser, Secondary 
Structure, Expression Profile, Target Info, Wikipedia 
and Other Links (Figure 1). The 'Basic Info' section 
provides sRNA sequence information and information 
such as identification method, terminators, Hfq binding 
and growth phase. Positions of sRNAs could be graphic- 
ally visualized with the popular UCSC Archaeal Genome 
Browser (19) implemented in BSRD. Secondary structures 
of sRNAs are visualized by the RNAfold (20) and Mfold 
(21) programs. The 'Expression Profile' section provides 
expression evidence of sRNAs in different experimental 
conditions collected from the NCBI Gene Expression 
Omnibus (22) database. sRNA pathogenesis profiling 
data obtained by the recently emerging Tn-Seq approach 
(23) is also included. 

Identification of sRNA targets is an initial step to 
understand the regulatory function of sRNAs. We have 
acquired 138 sRNA-target interactions from sRNATa 
rBase and manually curated 56 new sRNA-target inter- 
actions from the literatures. The sRNA-target interactions 
were then combined with transcription factor-target inter- 
actions to form the regulatory networks. Sigma factors, 
which act as upstream regulators to regulate sRNA tran- 
scription (24), were also added into the networks. 
Moreover, target genes of identified sRNAs predicted 
using IntaRNA (25) and RNAplex (26) were also 
provided. 

As the Wikipedia-based community annotation 
platform has been successful in Rfam and miRBase (27), 
the same platform is also implemented in BSRD. 
Wikipedia pages for all sRNA entries have been reviewed 
manually to avoid vandalism before implementing into 
BSRD. As most sRNAs still do not have annotation 
pages in Wikipedia, a link to a brief guide for creating 
and editing a new Wikipedia page is also provided. 
Finally, we have provided cross-links to a selected list of 
external databases, including Rfam, the Gene Ontology 
(28), Sequence Ontology (29), RegulonDB, EchoBASE 
(30), EcoGene (31) and EcoCyc, for access to additional 
information of the sRNAs. 



BSRD INTERFACE AND FUNCTIONALITIES 

There are nine sections in the BSRD main menu: Home, 
Search BSRD, Hierarchical taxonomy, Regulatory 
network, BLAST BSRD, Download, sRNADeep, 
Submission and Latest publications. From the 'Home' 
page, a summary of numbers of sRNAs archived in the 
latest version of the database is available in the 'Current 
release' section. BSRD hosts 9579 sRNA entries from 957 
bacterial strains. Answers to the most frequently asked 
questions are provided in the 'FAQ' section, and a help 
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Figure 1. Overview of BSRD design. The three main characteristics of BSRD are (i) comprehensive data collection from external databases and the 
literatures, (ii) comprehensive annotation and expression profiles for sRNAs and (iii) a novel RNA-Seq analysis platform, sRNADeep, for 
characterizing sRNAs from high-throughput sequencing data. 



documentation is also included in the 'Help' section. The 
'Latest Update' section provides news about recent 
updates of the database. 

From the 'Search BSRD' page, users can search for 
sRNAs in BSRD with three options: by sRNA id or 
name, by sRNA class or by genomic position. 
Alternatively, users could examine sRNAs according to 
the host organism from the list of bacteria in the 
'Hierarchical taxonomy' page. Additionally, users could 
go to the 'BLAST BSRD' page for direct input or 
upload of sRNA sequences to do quick search against 
BSRD using the BLAST program. Results will be sorted 
by alignment scores with single nucleotide variations high- 
lighted (Supplementary Figure SI). The 'Download' page 
provides options for users to download sRNA sequences 
in FASTA format according to the sRNA id or name, the 
bacterial host or batch. The 'Submit' page allows users to 
submit new sRNAs or annotations to BSRD. The 'Latest 
publications' page enables users to access the latest articles 
related to 'bacterial sRNAs' in PubMed and Microsoft 
Academic Research Databases. 

Regulatory network 

Transcription factor and sRNA can bind to the same 
target, whereas the clearance rates, steady-state concentra- 
tions and response curves can determine the dynamics of 
these regulatory networks (32). Regulatory networks in 
BSRD are constructed using cytoscape web (33) 
(Figure 2). Different colours were assigned for different 
elements of the networks: sRNA (yellow), target gene 
(orange), sigma factor (red) and transcription factor 
(blue). Regulatory relationships were also differentiated 
with different line patterns: repression (T-shaped) and in- 
ducement (Arrow). 



sRNADeep 

sRNADeep is a novel platform for sRNA expression 
profiling from RNA-Seq data (Supplementary Figure S2). 
It can not only annotate expressed sRNAs from a single set 
of transcriptome data, but also identify differentially ex- 
pressed sRNAs from two different conditions sRNADeep 
accepts compressed clean reads archives, which will then be 
mapped against the non-redundant sRNA set in BSRD 
using Burrows-Wheeler Alignment (BWA) (34), with a 
maximum of one mismatch allowed. Clean reads means 
filtered raw reads after adapter removal and quality 
trimming. The expectation maximization-based SEQ-EM 
algorithm (35) is used to handle multi-mapped reads. For 
a single dataset, the number of reads for each sRN A will be 
calculated and normalization will be performed using the 
reads per kilo-base per million method (36). For analysis of 
two datasets, DESeq (37) will be used to identify differen- 
tially expressed sRNAs between the samples. 

On job submission, users should provide a valid 
email address for receiving a job ID, which sRNADeep 
assigns, for result retrieval. A typical output of 
sRNADeep includes the length distribution of clean 
reads, the distribution of mapped reads and expression 
levels of sRNAs or differentially expressed sRNAs 
(Supplementary Figure S3). 

DISCUSSION 

Compared with other currently available sRNA-related 
databases, BSRD is more advanced in three aspects. 
First, BSRD hosts the largest collection of sRNAs 
(Table 1). It encompasses 964 experimentally validated 
sRNAs, 8248 sRNA homologs and 507 candidate 
sRNAs from high-throughput datasets. sRNAMap, for 
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Table 1. Comparison of BSRD with other available resources 
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instance, collects only 87 validated sRNAs and 310 sRNA 
homologs. 

Second, BSRD not only provides extensive functional 
descriptions for sRNAs, but also includes multiple new 
sRNA annotations from manually curated literature 
mining, including growth phase, Hfq binding and 
Rho-independent terminators. It also gives access to 
large-scale target search prediction of identified sRNAs. 
We have also integrated information of upstream regulon 
sigma factors to sRNA regulatory networks for a more 
comprehensive visualization of regulatory functions. 

Third, although recent developments of deep 
sequencing technology have advanced sRNA researches, 
web-based tools for annotating sRNAs from 



high-throughput sequencing data are unavailable. We 
have thus developed sRNADeep to meet this need. We 
evaluated the performance of sRNADeep with the tran- 
scriptome data of Listeria monocytogenes (38). All 13 dif- 
ferentially expressed sRNAs previously reported were 
successfully recovered by sRNADeep. In addition, nine 
previously uncharacterized sRNAs were also identified 
by sRNADeep. sRNADeep could be a useful tool for 
characterizing sRNAs from deep sequencing data. 

FUTURE DEVELOPMENTS 

We will continue to import information concerning new 
bacterial genomes and update sRNA annotations in 
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BSRD. We also welcome submissions of novel sRNAs 
or annotations. Because of the expanded use of high- 
throughput deep sequencing, we expect to develop 
functions such as the evaluation of the effects of sRNA 
binding from transcriptome data, and prediction of novel 
sRNAs by an improved version of sRNADeep. 

AVAILABILITY 

BSRD is freely available at http://kwanlab.bio.cuhk.edu. 
hk/BSRD. All sRNA sequences are also available 
for download in FASTA format. There are no access 
restrictions for academic and commercial use. The 
content of BSRD is freely available under the ODC 
Open Database License. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 and Supplementary Figures 1-3. 
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